Scheduling Hyperparameters to Improve Generalization: From Centralized SGD to Asynchronous SGD

نویسندگان

چکیده

This paper 1 studies how to schedule hyperparameters improve generalization of both centralized single-machine stochastic gradient descent (SGD) and distributed asynchronous SGD (ASGD). augmented with momentum variants (e.g., heavy ball (SHB) Nesterov’s accelerated (NAG)) has been the default optimizer for many tasks, in environments. However, advanced variants, despite empirical advantage over classical SHB/NAG, introduce extra tune. The error-prone tuning is main barrier AutoML. Centralized : We first focus on show efficiently a large class generalization. propose unified framework called multistage quasi-hyperbolic (Multistage QHM), which covers family as its special cases (e.g. vanilla SGD/SHB/NAG). Existing works mainly only scheduling learning rate α ’s decay, while QHM allows additional varying factor), demonstrates better than . convergence general nonconvex objectives. Distributed then extend our theory (ASGD), parameter server distributes data batches several worker machines updates parameters via aggregating batch gradients from workers. quantify asynchrony between different workers (i.e., staleness), model dynamics iterations differential equation (SDE), derive PAC-Bayesian bound ASGD. As byproduct, we moderately helps ASGD generalize better. Our strategies have rigorous justifications rather blind trial-and-error theoretically prove why could decrease derived errors cases. simplify process beat competitive optimizers test accuracy empirically. codes are publicly available https://github.com/jsycsjh/centralized-asynchronous-tuning.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Faster Asynchronous SGD

Asynchronous distributed stochastic gradient descent methods have trouble converging because of stale gradients. A gradient update sent to a parameter server by a client is stale if the parameters used to calculate that gradient have since been updated on the server. Approaches have been proposed to circumvent this problem that quantify staleness in terms of the number of elapsed updates. In th...

متن کامل

Improving Generalization Performance by Switching from Adam to SGD

Despite superior training outcomes, adaptive optimization methods such as Adam, Adagrad or RMSprop have been found to generalize poorly compared to Stochastic gradient descent (SGD). These methods tend to perform well in the initial portion of training but are outperformed by SGD at later stages of training. We investigate a hybrid strategy that begins training with an adaptive method and switc...

متن کامل

The Effects of Hyperparameters on SGD Training of Neural Networks

The performance of neural network classifiers is determined by a number of hyperparameters, including learning rate, batch size, and depth. A number of attempts have been made to explore these parameters in the literature, and at times, to develop methods for optimizing them. However, exploration of parameter spaces has often been limited. In this note, I report the results of large scale exper...

متن کامل

Theory of Deep Learning III: Generalization Properties of SGD

In Theory III we characterize with a mix of theory and experiments the consistency and generalization properties of deep convolutional networks trained with Stochastic Gradient Descent in classification tasks. A present perceived puzzle is that deep networks show good predicitve performance when overparametrization relative to the number of training data suggests overfitting. We describe an exp...

متن کامل

Statistical inference using SGD

We present a novel method for frequentist statistical inference in M -estimation problems, based on stochastic gradient descent (SGD) with a fixed step size: we demonstrate that the average of such SGD sequences can be used for statistical inference, after proper scaling. An intuitive analysis using the OrnsteinUhlenbeck process suggests that such averages are asymptotically normal. From a prac...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: ACM Transactions on Knowledge Discovery From Data

سال: 2022

ISSN: ['1556-472X', '1556-4681']

DOI: https://doi.org/10.1145/3544782